EDA
Now that we have presented the variables contained in the dataset, let’s try to understand the data structure, characteristics and underlying patterns thanks to an EDA.
Dataset overview
Columns description
Let’s have a quick look at the characteristics of the columns. You will find more statistical details about it in the annexe.
| Name | Number_of_rows | Number_of_columns | Character | Numeric | Group_variables |
|---|---|---|---|---|---|
| data | 45896 | 26 | 8 | 18 | None |
The dataset that we are working with contains approx. 46’000 rows and 26 columns, each row representing a model from one of the 141 brands. From the data overview, we can see that most of our features are concerning the consumption of the cars. If we now check more in details in the annex, we notice that some variables contain a lot of missing and that the variable “Time.to.Charge.EV..hours.at.120v.” is only containing 0s. We will handle these issues in the section “Data cleaning”.
Exploration of the distribution
Now let’s explore the distribution of the numerical features.


As the majority of models in our dataset are neither electric vehicles (EVs) nor hybrid cars and because of the nature of some column concerning only these two types of vehicles, the results are showing numerous zero values in several columns. This issue will be addressed during the data cleaning process. Additionally, certain features, such as “Engine Cylinders,” are numerically discrete, as illustrated in the corresponding plot.
Outliers Detection
In order identify occurences that deviate significantly for the rest of the observations, and in order to potentially improve the global quality of the data, we have decided to analyse outliers thanks to boxplots. Here are the result on the numerical columns of the dataset:

Most of our boxplots are showing extreme values. Again, this is due to the small amount of EV and hybrid cars in our dataset compared to the rest of the models and due to the nature of some features, concerning only those type of vehicles.
Number of models per make
Now let’s check how many models per make we have in our dataset. In order to have a clear plot, we have decided to keep the top 20 brands among all the make on the graph. All the remaining makes are accessible on the table just below.

On the 141 brands, we notice that only 13 brands have more than 1000 models in the dataset. Among these, only one of them (Chevrolet) have more than 4000 models presents. In addition, as we can see in the appendix, we noticed that many car brands have 1 observation, which would lead to class unbalanced and would lead to bias toward majority classes. Therefore, we have decided to address this issue later.
